Rhys Hewer
REMEMBER!! Combine the data sets at the end!
#load libraries
library(readxl)
library(dplyr)
library(ggplot2)
library(plotly)
library(RColorBrewer)
library(corrplot)
library(caret)
library(tidyr)
library(kableExtra)
#read in data
setwd("C:/Users/rhysh/Google Drive/Data Science/Ubiqum/Project 2/Task 2")
incomplete <- read.csv("SurveyIncomplete.csv")
origdata <- read_xlsx("Survey_Key_and_Complete_Responses_excel.xlsx", sheet = 2)
data <- origdata
As we are working across 2 spreadsheets it is important to check to see if they are structured alike or whether manipulation will be required to ensure they have the same features.
#Check structure of data to see if alike
names(incomplete) == names(data)
## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE
The features match between the spreadsheets so no data manipulation is needed in that respect.
incomplete %>% sample_n(5)
## salary age elevel car zipcode credit brand
## 4719 135971.94 41 4 12 0 31624.42 0
## 229 123414.45 51 0 15 2 126588.87 0
## 1228 102992.34 61 3 3 2 425504.16 0
## 1562 24635.88 52 1 11 7 409696.47 0
## 758 59821.14 23 3 19 6 194678.71 0
data %>% sample_n(5)
## # A tibble: 5 x 7
## salary age elevel car zipcode credit brand
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 29569. 78 3 6 6 312884. 0
## 2 79923. 46 1 4 2 498929. 0
## 3 51553. 27 2 11 5 411691. 0
## 4 63190. 35 0 16 5 249350. 0
## 5 36789. 68 0 11 7 302899. 0
A quick look at a sample from both spreadsheets shows that they are very similar in composition.
str(incomplete)
## 'data.frame': 5000 obs. of 7 variables:
## $ salary : num 110500 140894 119160 20000 93956 ...
## $ age : int 54 44 49 56 59 71 32 33 32 58 ...
## $ elevel : int 3 4 2 0 1 2 1 4 1 2 ...
## $ car : int 15 20 1 9 15 7 17 17 19 8 ...
## $ zipcode: int 4 7 3 1 1 2 1 0 2 4 ...
## $ credit : num 354724 395015 122025 99630 458680 ...
## $ brand : int 0 0 0 0 0 0 0 0 0 0 ...
str(data)
## Classes 'tbl_df', 'tbl' and 'data.frame': 10000 obs. of 7 variables:
## $ salary : num 119807 106880 78021 63690 50874 ...
## $ age : num 45 63 23 51 20 56 24 62 29 41 ...
## $ elevel : num 0 1 0 3 3 3 4 3 4 1 ...
## $ car : num 14 11 15 6 14 14 8 3 17 5 ...
## $ zipcode: num 4 6 2 5 4 3 5 0 0 4 ...
## $ credit : num 442038 45007 48795 40889 352951 ...
## $ brand : num 0 1 0 1 0 1 1 1 0 1 ...
Looking at the structure shows that a few changes are needed in respect of the data types.
Overall, however, the datasets are sufficiently similar. I will take the strategy of splitting into training, test and validation sets. The training and testing sets will come from the complete responses data. Validation from the incomplete responses.
The exploratory data analysis will be performed on the complete data but any data transformations taking place on the training/testing sets will also need to be made on the validation data prior to modelling this.
#check for NAs
data %>% is.na() %>% sum()
## [1] 0
incomplete %>% is.na() %>% sum()
## [1] 0
There are no missing values in either data set.
#data types
data$elevel <- data$elevel %>% as.factor()
data$car <- data$car %>% as.factor()
data$zipcode <- data$zipcode %>% as.factor()
data$brand <- data$brand %>% as.factor()
Education level, car owned, zip code and brand preference are converted from numeric to factors.
##check for outliers
numericVars <- Filter(is.numeric, data)
outliers <- numericVars %>% sapply(function(x) boxplot(x, plot=FALSE)$out) %>% str()
## List of 3
## $ salary: num(0)
## $ age : num(0)
## $ credit: num(0)
There are no outliers.
#EDA
##Key feature is brand preference, begin with exploring this value.
g6 <- ggplot(data, aes(brand, fill = brand)) +
geom_bar() +
theme_bw() +
scale_fill_brewer(palette="Dark2") +
xlab("Brand Preference") +
ylab("Frequency") +
ggtitle("Brand Preference Frequencies")
g6
We see a clear preference for brand 1 over brand 0.
##review other histograms for skewdness
histData <- origdata %>% select(-brand)
g8 <- ggplot(gather(histData), aes(value)) +
geom_histogram(bins = 10, fill = "#D95F02", colour = "white") +
theme_bw() +
facet_wrap(~key, scales = 'free_x') +
xlab("Value") +
ylab("Count") +
ggtitle("Histograms of Numeric Variables")
g8
Across the remaining features we see no extreme skewing or noteworthy patterns.
## Plot Brand choice v other variables
g1 <- ggplot(data, aes(brand, salary, fill = brand)) +
geom_violin() +
theme_bw() +
scale_fill_brewer(palette="Dark2") +
xlab("Brand Preference") +
ylab("Salary ($)") +
ggtitle("Brand Preference v Salary")
g1 <- ggplotly(g1)
g1
There is a clear pattern: salaries ranging between 45k - 100k seem to have a preference for brand 0. Salaries outside of this range seem to have a preference for brand 1.
g2 <- ggplot(data, aes(brand, age, fill = brand)) +
geom_violin() +
theme_bw() +
scale_fill_brewer(palette="Dark2") +
xlab("Brand Preference") +
ylab("Age") +
ggtitle("Brand Preference v Age")
g2 <- ggplotly(g2)
g2
There is no noteworthy pattern between brand preference and age.
g3 <- ggplot(data, aes(brand, elevel, colour = brand)) +
geom_count() +
theme_bw() +
scale_color_brewer(palette="Dark2") +
xlab("Brand Preference") +
ylab("Education Level") +
ggtitle("Brand Preference v Education Level")
g3
The general preference for brand 1 is shown in this plot but there also seems to be a consistent preference for brand 1 regardless of education level whereas there seems to be more variation within preference for brand 0 based on the education level.
g4 <- ggplot(data, aes(brand, car, colour = brand)) +
geom_count() +
theme_bw() +
scale_color_brewer(palette="Dark2") +
xlab("Brand Preference") +
ylab("Car") +
ggtitle("Brand Preference v Car")
g4
The general preference for brand 1 is shown in this plot but there also seems to be a consistent preference for brand 1 regardless of car whereas there seems to be more variation within preference for brand 0 based on the car owned.
g5 <- ggplot(data, aes(brand, zipcode, colour = brand)) +
geom_count() +
theme_bw() +
scale_color_brewer(palette="Dark2") +
xlab("Brand Preference") +
ylab("Zip Code") +
ggtitle("Brand Preference v Zip Code")
g5
The general preference for brand 1 is shown in this plot but there also seems to be a consistent preference for brand 1 regardless of zip code whereas there seems to be more variation within preference for brand 0 based on the car owned.
g7 <- ggplot(data, aes(brand, credit, fill = brand)) +
geom_violin() +
theme_bw() +
scale_fill_brewer(palette="Dark2") +
xlab("Brand Preference") +
ylab("Credit") +
ggtitle("Brand Preference v Credit")
g7 <- ggplotly(g7)
g7
There is no noteworthy pattern between brand preference and credit.
##Review correlation matrix
corrMatrix <- origdata %>% cor()
corrMatrix %>% corrplot.mixed()
Reviewing the correlation plot we see very little correlation between the features. This means that colinearity is not an issue that needs to be addressed.
#near zero variance
nzv <- data %>% nearZeroVar(saveMetrics = TRUE)
nzv
## freqRatio percentUnique zeroVar nzv
## salary 1.112069 97.53 FALSE FALSE
## age 1.178744 0.61 FALSE FALSE
## elevel 1.035946 0.05 FALSE FALSE
## car 1.026415 0.20 FALSE FALSE
## zipcode 1.020105 0.09 FALSE FALSE
## credit 1.057377 97.51 FALSE FALSE
## brand 1.643405 0.02 FALSE FALSE
There are no features with near zero variance.
There is a general preference for brand 1 and salary seems the key feature.
Age and credit seem to have very limited influence on brand preference. Whilst there are patterns of preference within education level, car and zip code, salary appears to have the most striking impact.
Salaries ranging between 45k - 100k seem to have a preference for brand 0. Salaries outside of this range seem to have a preference for brand 1.
Normalisation of data